import pandas as pd
df = pd.read_csv('lightcast_job_postings.csv')Data Analysis
Comprehensive Data Cleaning & Exploratory Analysis of Job Market Trends
#4.1
columns = df.columns.tolist()
for i in range(0, len(columns), 5):
print(columns[i:i+5])['ID', 'LAST_UPDATED_DATE', 'LAST_UPDATED_TIMESTAMP', 'DUPLICATES', 'POSTED']
['EXPIRED', 'DURATION', 'SOURCE_TYPES', 'SOURCES', 'URL']
['ACTIVE_URLS', 'ACTIVE_SOURCES_INFO', 'TITLE_RAW', 'BODY', 'MODELED_EXPIRED']
['MODELED_DURATION', 'COMPANY', 'COMPANY_NAME', 'COMPANY_RAW', 'COMPANY_IS_STAFFING']
['EDUCATION_LEVELS', 'EDUCATION_LEVELS_NAME', 'MIN_EDULEVELS', 'MIN_EDULEVELS_NAME', 'MAX_EDULEVELS']
['MAX_EDULEVELS_NAME', 'EMPLOYMENT_TYPE', 'EMPLOYMENT_TYPE_NAME', 'MIN_YEARS_EXPERIENCE', 'MAX_YEARS_EXPERIENCE']
['IS_INTERNSHIP', 'SALARY', 'REMOTE_TYPE', 'REMOTE_TYPE_NAME', 'ORIGINAL_PAY_PERIOD']
['SALARY_TO', 'SALARY_FROM', 'LOCATION', 'CITY', 'CITY_NAME']
['COUNTY', 'COUNTY_NAME', 'MSA', 'MSA_NAME', 'STATE']
['STATE_NAME', 'COUNTY_OUTGOING', 'COUNTY_NAME_OUTGOING', 'COUNTY_INCOMING', 'COUNTY_NAME_INCOMING']
['MSA_OUTGOING', 'MSA_NAME_OUTGOING', 'MSA_INCOMING', 'MSA_NAME_INCOMING', 'NAICS2']
['NAICS2_NAME', 'NAICS3', 'NAICS3_NAME', 'NAICS4', 'NAICS4_NAME']
['NAICS5', 'NAICS5_NAME', 'NAICS6', 'NAICS6_NAME', 'TITLE']
['TITLE_NAME', 'TITLE_CLEAN', 'SKILLS', 'SKILLS_NAME', 'SPECIALIZED_SKILLS']
['SPECIALIZED_SKILLS_NAME', 'CERTIFICATIONS', 'CERTIFICATIONS_NAME', 'COMMON_SKILLS', 'COMMON_SKILLS_NAME']
['SOFTWARE_SKILLS', 'SOFTWARE_SKILLS_NAME', 'ONET', 'ONET_NAME', 'ONET_2019']
['ONET_2019_NAME', 'CIP6', 'CIP6_NAME', 'CIP4', 'CIP4_NAME']
['CIP2', 'CIP2_NAME', 'SOC_2021_2', 'SOC_2021_2_NAME', 'SOC_2021_3']
['SOC_2021_3_NAME', 'SOC_2021_4', 'SOC_2021_4_NAME', 'SOC_2021_5', 'SOC_2021_5_NAME']
['LOT_CAREER_AREA', 'LOT_CAREER_AREA_NAME', 'LOT_OCCUPATION', 'LOT_OCCUPATION_NAME', 'LOT_SPECIALIZED_OCCUPATION']
['LOT_SPECIALIZED_OCCUPATION_NAME', 'LOT_OCCUPATION_GROUP', 'LOT_OCCUPATION_GROUP_NAME', 'LOT_V6_SPECIALIZED_OCCUPATION', 'LOT_V6_SPECIALIZED_OCCUPATION_NAME']
['LOT_V6_OCCUPATION', 'LOT_V6_OCCUPATION_NAME', 'LOT_V6_OCCUPATION_GROUP', 'LOT_V6_OCCUPATION_GROUP_NAME', 'LOT_V6_CAREER_AREA']
['LOT_V6_CAREER_AREA_NAME', 'SOC_2', 'SOC_2_NAME', 'SOC_3', 'SOC_3_NAME']
['SOC_4', 'SOC_4_NAME', 'SOC_5', 'SOC_5_NAME', 'LIGHTCAST_SECTORS']
['LIGHTCAST_SECTORS_NAME', 'NAICS_2022_2', 'NAICS_2022_2_NAME', 'NAICS_2022_3', 'NAICS_2022_3_NAME']
['NAICS_2022_4', 'NAICS_2022_4_NAME', 'NAICS_2022_5', 'NAICS_2022_5_NAME', 'NAICS_2022_6']
['NAICS_2022_6_NAME']
#4.2
columns_to_drop = [
"ID", "URL", "ACTIVE_URLS", "DUPLICATES", "LAST_UPDATED_TIMESTAMP",
"NAICS2", "NAICS3", "NAICS4", "NAICS5", "NAICS6",
"SOC_2", "SOC_3", "SOC_5"
]
df.drop(columns=columns_to_drop, inplace=True)4.2: The above columns are to be dropped because they are not relevant in our analysis. For example, ID is not an insightful variable here because it entails little numerical value for our analysis (i.e. different IDs do not represent different meaningful characteristics for the posted job). Other variables such as NAICS4 or NAICS5 will be deleted because they are out-dated.
pd.DataFrame(df.columns, columns=["Remaining Columns"])| Remaining Columns | |
|---|---|
| 0 | LAST_UPDATED_DATE |
| 1 | POSTED |
| 2 | EXPIRED |
| 3 | DURATION |
| 4 | SOURCE_TYPES |
| ... | ... |
| 113 | NAICS_2022_4_NAME |
| 114 | NAICS_2022_5 |
| 115 | NAICS_2022_5_NAME |
| 116 | NAICS_2022_6 |
| 117 | NAICS_2022_6_NAME |
118 rows × 1 columns
Dropping irrelavant columns will make our database more compact and will save us cognitive efforts when we are interpreting graphs, as now we don’t need to interpret the bars or plots for those irrelavant columns. We can therefore focus on more meaningful columns such as the SALARY or industry type.
#4.4
import missingno as msno
import matplotlib.pyplot as plt
msno.heatmap(df)
plt.title("Missing Values Heatmap")
plt.show()
# Drop columns with more than 50% missing values
keep_columns = ["INDUSTRY", "SALARY"]
missing_percent = df.isnull().mean()
cols_to_drop = missing_percent[
(missing_percent > 0.5) & (~missing_percent.index.isin(keep_columns))
].index
df.drop(columns=cols_to_drop, inplace=True)# Fill missing values
for col in df.columns:
if df[col].dtype in ['int64', 'float64']:
df[col] = df[col].fillna(df[col].median())
elif df[col].dtype == 'object':
df[col] = df[col].fillna("Unknown")Here we are handling the missing values in the INDUSTRY, SALARY, NAICS2_NAME, and other columns. We drop all columns with >50% missing values, and for the salary and the NAICS2_NAME (containing industry names) columns, we fill NAs with the median value and ‘Unknown’, respectively.
df.head(3)| LAST_UPDATED_DATE | POSTED | EXPIRED | DURATION | SOURCE_TYPES | SOURCES | TITLE_RAW | BODY | MODELED_EXPIRED | MODELED_DURATION | ... | NAICS_2022_2 | NAICS_2022_2_NAME | NAICS_2022_3 | NAICS_2022_3_NAME | NAICS_2022_4 | NAICS_2022_4_NAME | NAICS_2022_5 | NAICS_2022_5_NAME | NAICS_2022_6 | NAICS_2022_6_NAME | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024-09-06 | 2024-06-02 | 2024-06-08 | 6.0 | [\n "Company"\n] | [\n "brassring.com"\n] | Enterprise Analyst (II-III) | 31-May-2024\n\nEnterprise Analyst (II-III)\n\n... | 2024-06-08 | 6.0 | ... | 44 | Retail Trade | 441 | Motor Vehicle and Parts Dealers | 4413 | Automotive Parts, Accessories, and Tire Retailers | 44133 | Automotive Parts and Accessories Retailers | 441330 | Automotive Parts and Accessories Retailers |
| 1 | 2024-08-02 | 2024-06-02 | 2024-08-01 | 18.0 | [\n "Job Board"\n] | [\n "maine.gov"\n] | Oracle Consultant - Reports (3592) | Oracle Consultant - Reports (3592)\n\nat SMX i... | 2024-08-01 | 16.0 | ... | 56 | Administrative and Support and Waste Managemen... | 561 | Administrative and Support Services | 5613 | Employment Services | 56132 | Temporary Help Services | 561320 | Temporary Help Services |
| 2 | 2024-09-06 | 2024-06-02 | 2024-07-07 | 35.0 | [\n "Job Board"\n] | [\n "dejobs.org"\n] | Data Analyst | Taking care of people is at the heart of every... | 2024-06-10 | 8.0 | ... | 52 | Finance and Insurance | 524 | Insurance Carriers and Related Activities | 5242 | Agencies, Brokerages, and Other Insurance Rela... | 52429 | Other Insurance Related Activities | 524291 | Claims Adjusting |
3 rows × 109 columns
#4.5
df = df.drop_duplicates(subset=["TITLE", "COMPANY", "LOCATION", "POSTED"], keep="first")import plotly.io as pio
pio.renderers.default = 'notebook'#5.1.1
import plotly.express as px
industry_counts = df["NAICS2_NAME"].value_counts()
fig = px.bar(industry_counts,
title="Job Postings by Industry",
labels={'index': 'Industry', 'value': 'Number of Job Postings'},
text_auto=True)
fig.show()5.1.1: The graph of job posting by industry is chosen because we need to know which industries are more ‘eager’ to hiring. This can guide job seekers about which industries to choose when deciding where to send their resumes. The graph shows that the Professional, Scientific, and Technical Services industry dominates in terms of job posted (with more than 22k jobs posted). This implies that more opportunities exist in that industry and job seekers should think about advancing their skills corresponding to the skill requirements in this industry.
#5.1.2
fig = px.box(df,
x="NAICS2_NAME",
y="SALARY",
title="Salary Distribution by Industry",
labels={"Industry": "Industry", "Salary": "Salary"},
points="all")
fig.update_layout(xaxis_tickangle=-45)
fig.show()df.groupby("NAICS2_NAME")["SALARY"].mean().sort_values(ascending=False)NAICS2_NAME
Information 129838.706942
Professional, Scientific, and Technical Services 122545.236573
Accommodation and Food Services 121704.673684
Retail Trade 119727.348942
Manufacturing 118032.224153
Finance and Insurance 117732.917823
Utilities 116621.473846
Construction 115786.905537
Management of Companies and Enterprises 114972.733209
Wholesale Trade 113760.985653
Mining, Quarrying, and Oil and Gas Extraction 113593.815152
Unclassified Industry 113440.233087
Transportation and Warehousing 111254.506798
Administrative and Support and Waste Management and Remediation Services 110401.264970
Agriculture, Forestry, Fishing and Hunting 109806.578947
Health Care and Social Assistance 109014.679035
Other Services (except Public Administration) 105049.230715
Real Estate and Rental and Leasing 103524.830334
Arts, Entertainment, and Recreation 102223.296667
Educational Services 99009.963886
Public Administration 92306.371459
Name: SALARY, dtype: float64
5.1.2: In order to gain a deeper understanding about salaries of jobs in different industries, we need a box plot to show us the salary distributions for various industries. Results indicate that the mean salaries among all the industries are almost at the same level (around 100K). However, jobs in certain industries have the potential to reach substantially more salaries than the average. These industries include Administrative and Support and Waste Management and Remediation Services, Information, and Health Care and Social Assistance. Job seekers may want to look into these industries if they aspire to get higher salaries when they advance in their positions.
fig = px.pie(df, names="REMOTE_TYPE_NAME", title="Remote vs. On-Site Jobs")
fig.show()5.1.3: The remost vs. on-site job chart enables us to know whether the jobs posted allow remote working or require on-site working. The data here does not allow us to see a bigger picture due to the 78.3% None value in the column, but from the remaining data we do see that 17% of jobs allow remote working. This can imply that remote work has gained its momentum, and that job seekers nowadays are faced with both opportunities and challenges brought by remote work. For the opportunities, workers now have more flexibility in working location and are no longer binded by physical offices, meaning that they will save time of daily commute and can have better work-life balance. The presence of AI and computer technology also facilitates remote working by enabling workers to work anywhere with internet connections. However, the downside of remote working is that social interactions among workers will decrease. Team spirits and cooperation will diminish when workers are not present in the same physical space. Future job seekers, as well as hiring companies, need to evaluate the ups and downs of this remote working trend in order to optimize work motivation.